SUPPLEMENT TO 



ISSN 0262-4028 


British 

Telecommunications 

Engineering 

Vol. 7 Part 3 October 1988 


RELIABILITY ENGINEERING 

DAVID J. SMITHf 


t Consultant 


© 1988: DJS 



PREFACE 


This text has been written to provide an introduction to the need for the use of reliability 
engineering techniques during the design, manufacture and use of equipment in the 
telecommunications industry. A fuller explanantion of each of the areas explained is 
available in Reliability and Maintainability in Perspective *, David J. Smith, third edition, 
1988, Macmillan Education. 

[Editors Note : This paper has been written by an external consultant and deals with 
reliability in general terms. Detailed advice and assistance on reliability assurance or 
reliability in general can be obtained from Materials Services, Quality Assurance Division 
(01-226 1262 Ext. 213), or Materials and Components Centre (021-771 1001 Ext. 6796)]. 

CONTENTS 


RELIABILITY AND COST EFFECTIVENESS 3 

An introduction to the use of the failure rate concept as a design parameter with cost- 
related implications. 

TERMS AND JARGON 4 

An explanation of the various terms used in reliability and maintainability. 

INTERRELATIONSHIP OF PARAMETERS 8 

The basic reliability equations necessary for a simple understanding of the subject. 

ACHIEVING RELIABILITY IN DESIGN 10 

An outline of the design methods which can achieve specific reliability objectives. 

ACHIEVING RELIABILITY IN USE 13 

The methods of minimising failures during manufacture and operation. 

DATA INTERPRETATION AND TEST 16 

The need for failure data and the methods of deducing failure rates from field data. 

RELIABILITY ASSESSMENT AND MODELLING 20 

How to predict system reliability taking account of equipment redundancy and common¬ 
mode failures. 

FAILURE RATE DATA SOURCES 27 

A description of available failure rate data sources and the pitfalls involved. 

MAINTAINABILITY 28 

Features in design and maintenance which influence the repair and down times of 

equipment. 

SOFTWARE RELIABILITY AND QUALITY 29 

The features of software-related failure and the defences and tools available. 

MANAGEMENT OF RELIABILITY 32 

Organisation and standards. 

BIBLIOGRAPHY 32 

A brief list of standards, guidelines and literature for further reading. 


2 


Supplement to Br. Telecommun. Eng., Vol. 7, Oct. 1988 



RELIABILITY AND COST EFFECTIVENESS 


With the continuing increase in demand for communications equipment, which has 
resulted in larger and more complex telecommunications systems, there has been a focus 
on the need to specify and measure the levels of reliability involved. This has resulted 
from the high cost of failures both in terms of repair costs and the costs of lost revenue 
from faulty circuits. 

The practice of identifying appropriate reliability requirements during specification and 
prior to design is now common. This leads to the use of reliability prediction/assessment 
techniques during the design cycle as a means of reviewing the progress of the design 
towards meeting these requirements. In turn, the reliability assessment process generates 
the need for field failure rate data in order to carry out the predictions. 

These failure rate collection activities and reliability assessment techniques make up 
the practice of reliability engineering and this text explains the various methods used and 
the pitfalls associated with them. 

It should not be assumed that reliability levels are chosen at random or that high 
reliability is attempted purely for its own sake. Like all other engineering parameters, an 
optimum value exists for each situation which balances the cost of achieving high reliability 
with the penalties which arise from failures. It is thus necessary to justify the cost of high 
reliability against the savings in maintenance and revenue which ensue. In general, 
reliability enhancements are usually justified and it is seldom that an improvement costs 
more than the resultant savings. 

This concept is illustrated in Figure 1, which shows the relationship between reliability 
and cost. The manufacturing costs of design, qualification, procurement, screening and 
test all increase with reliability. On the other hand, the field costs, which include 
maintenance, spares, warranty, redesign and loss of revenue, all decrease as reliability 
improves. The total cost is shown by the curve in Figure 1, which indicates a minimum 
cost, optimum reliability level. 


Figure 1 



These curves illustrate the philosophy of optimum cost reliability whereby each enhance¬ 
ment or objective should be evaluated in terms of the cost of achievement versus the 
savings which will ensue. It would never be possible to plot absolute curves for any product 
because of the impossibility of separating each and every failure-related and reliability- 
related cost. 

It should be noted that the shape of the curves indicates that the cost of reliability 
continues to be justified well over to the right-hand side of the diagram. 

Consider the following simple example: 

A switching control system consists of a duplicated programmable control. Reliability 
assessment techniques predict that the particular mode of failure which results in a 
spurious shutdown of the system is likely to occur once a year. It is suggested that a 
triplicated voted control arrangement would improve the failure rate by an order of 
magnitude to 0* 1 shutdowns per year. The cost of the additional processor and additional 
design work is estimated to be £80 000. It is known that a shutdown normally involves 2 
hours loss of revenue from equipment which controls 1000 channels each earning £20 per 
hour. 

The potential savings are: 

0-9 (saving in failures) X 2 hours X 1000 X £20 = £36 000. 

The cost of the reliability enhancement is thus just over 2 years’s savings and is therefore 
probably justified. 


Supplement to Br.Telecommun. Eng., Vol. 7, Oct. 1988 


3 






TERMS AND JARGON 


It cannot be emphasised strongly enough that what we perceive to be the reliability of an 
item depends upon how the word failure has been defined. If a motor car is said to fail 
when it does not start, then the malfunction of a number of component parts will contribute 
to ‘system failure’. If, on the other hand, failure is redefined as inability to come to a halt, 
then different parts are involved and a different numerical value of reliability will be 
obtained. 

Before any further discussion of reliability parameters, it is therefore essential to address 
the problem of defining failure. Unless the failed state is described, then it is impossible 
to specify values of reliability. The only definition of failure is: 

NON-CONFORMANCE TO SOME DEFINED PERFORMANCE CRITERION 

Refinements which differentiate between such terms as defect, malfunction, failure and 
reject are important in contract clauses and in the classification and analysis of data but 
should not be allowed to cloud the above explanation. 

Consider the example of two diodes in Figure 2. 

In order to address the reliability of this circuit, it is likely that one would enquire as _. 

to the failure rate of the individual diodes. Failure rate is, of course, one particular * 

parameter describing reliability. 

This might be 0 -1 failures per million hours. One might imply therefore that our small 
system has a failure rate total of 0-2 per million hours. Life, however, is not quite so 
simple. Figure 2 

Suppose that the main concern is the loss of electrical continuity in which case the total 
failure rate could be expected to be greater than for a single diode, owing to the series 
nature of the configuration. In fact, it is double the failure rate of a single diode. Since 
we have been specific about the definition of failure a question arises concerning the 0 -1 
per million hours. 

Are they all failures which involve an open-circuit condition? 

Presumably, the 0-1 per million hours involves all types of failures, including short- 
circuit conditions, and only a subset—say, 0-025 per million hours—is relevant. 

Suppose, on the other hand, that we are concerned with the short-circuit condition 
within this circuit. The situation changes dramatically. Firstly, the fact that there are two 
diodes would appear to enhance rather than reduce the reliability since for this new failure 
mode they both need to fail. Secondly, the failure mode of interest at the component level 
is short circuit. A different calculation is now needed and this will be addressed later on. 

Table 1 shows a hypothetical breakdown of the failure rates for a junction diode. 

TABLE 1 


Failure Mode 

Failure Rates per Million Hours 

Open circuit 

0-025 

Short circuit 

0-015 

High reverse 

0-060 


The essential point is that the definition of system failure totally determines the system 
reliability and dictates the failure mode data required at the component level. The above 
example demonstrates the problem in a simple way, but in the analysis of complex 
equipment the effect is more subtle. 

Given then that the word ‘failure’ is specifically defined, for a given application, quality 
and reliability and maintainability can be defined as follows: 

QUALITY Conformance to specification. 


RELIABILITY The probability that an item will perform a required func¬ 

tion, under stated conditions, for a stated period of time. 
Reliability is therefore the extension of quality into the time 
domain and may be paraphrased as ‘the probability of non- 
failure in a given period’. 

MAINTAINABILITY The probability that a failed item will be restored to oper¬ 
ational effectiveness within a given period of time when the 
repair action is performed in accordance with prescribed 
procedures. This, in turn, can be paraphrased as ‘the prob¬ 
ability of repair in a given time*. 

Requirements are seldom expressed by stating values of reliability or of maintainability. 
There are useful related parameters such as failure rate, mean time between failures and 
mean time to repair which more easily describe them. Figure 3 provides a model against 
which failure rate can be defined. 



Figure 3 


4 


Supplement to Br. Telecommun. Eng., Vol. 7, Oct. 1988 


















The symbol for failure rate is X (lambda). Assume a batch of N items and that at any 
time / a number k have failed. The cumulative time, T, will be Nt if it is assumed that 
each failure is replaced when it occurs. In a non-replacement case, T is given by: 

T= {t l + t 2 + h....t k + (N -*)/}, 
where t\ is the occurrence of the first failure, etc. 

The Observed Failure Rate This is defined: For a stated period in the life of an item, 
the ratio of the total number of failures to the total cumulative observed time. If X is the 
failure rate of the N items, then the observed X is given by X = k/T . The A (hat) symbol 
is very important since it indicates that k/T is only an estimate of X. The true value will 
be revealed only when all N items have failed. Making inferences about XJrom values of 
k and T will be addressed later. It should also be noted that the value of X is the average 
over the period in question. The same value could be observed from increasing, constant 
and decreasing failure rates. This is analogous to the case of a motor car whose speed 
between two points is calculated as the ratio of distance to time, although the velocity 
may have varied during this interval. 

Failure rate, which has the unit of t~\ is sometimes expressed as a percentage per 
1000 h and sometimes as a number multiplied by a negative power of ten. Examples, 
having the same value, are: 

8500 per 10 9 hours (8500 FITS) 

8-5 per 10 6 hours 

0-85 per cent per 1000 hours 

0-074 per year 

Notice that these examples each have only two significant figures. It is seldom justified 
to exceed this level of accuracy, particularly if failure rates are being used to carry out a 
reliability prediction. 

The Observed Mean Time Between Failures (MTBF) This is defined: For a stated 
period in the life of an item, the mean value of the length of time between consecutive 
failures, computed as the ratio of the total cumulative observed time to the total number 
of failures. If 0 (theta) is the MTBF of the N items, then the observed MTBF is given 
by 6 = T/k. Once again, the hat symbol indicates a point estimate and the foregoing 
remarks apply. The use of T/k and k/T to define 6 and X leads to the inference that 6 = 
1 /X. This equality must be treated with caution since it is inappropriate to compute failure 
rate unless it is constant. 

The Observed Mean Time to Fail (MTTF) This is defined: For a stated period in the 
life of an item, the ratio of cumulative time to the total number of failures. Again this is 
T/k. The only difference between MTBF and MTTF is in their usage. MTTF is applied 
to items that are not repaired, such as bearings and transistors, and MTBF to items which 
are repaired. It must be remembered that the time between failures excludes the down 
time. MTBF is therefore mean up time between failures. In Figure 4, it is the average of 
the values of (/). 


Figure 4 



Mean Life This is defined as the mean of the times to failure where each item is 
allowed to fail. This is often confused with MTBF and MTTF. It is important to understand 
the difference. MTBF and MTTF can be calculated over any period as, for example, 
confined to the constant failure rate portion of the bathtub curve (see later). Mean life 
on the other hand must include the failure of every item and therefore takes into account 
the wear-out end of the curve. Only for constant failure rate situations are they the same. 

Bathtub Distribution 

The much used bathtub curve is an example of the practice of treating more than one 
failure type by a single classification. It seeks to describe the variation of failure rate of 


Supplement to Br.Telecommun. Eng., Vol. 7, Oct. 1988 


5 


















electrical components during their life. Figure 5 shows this generalised relationship as it 
is assumed to apply to electronic components. The failures exhibited in the first part of 
the curve, where failure rate is decreasing, are called early failures or infant mortality 
failures. The middle portion is referred to as the useful life and it is assumed that failures 
exhibit a constant failure rate; that is to say they occur at random. The latter part of the 
curve describes the wear-out failures and it is assumed that failure rate increases as the 
wear-out mechanisms accelerate. 

Figure 5 shows the bathtub curve to be the sum of three separate overlapping failure 
distributions. Labelling sections of the curve as wear out , burn in and random can now 
be seen in a different light. The wear-out region implies only that wear-out failure 
predominates, namely that such a failure is more likely than the other types. The three 
distributions are described in Table 2. 


Figure 5 
Bathtub curve 



TABLE 2 


Distribution 

Known as 

Comments 

Decreasing failure rate 

Infant mortality 
Burn-in 

Early failures 

Usually related to manufacture and quality 
assurance; for example, welds, joints, con¬ 
nections, wraps, dirt, impurities, cracks, 
insulation or coating flaws, incorrect adjust¬ 
ment or positioning. In other words, popu¬ 
lations of substandard items owing to micro¬ 
scopic flaws. 

Constant failure rate 

Random failures 
Useful life 
Stress-related fail¬ 
ures 

Stochastic failures 

Usually assumed to be stress-related fail¬ 
ures; that is to say, random fluctuations 
(transients) of stress exceeding the com¬ 
ponent strength. The design reliability 
is of this type. It is thought that inherent 
substandard items contribute to this cate¬ 
gory. 

Increasing failure rate 

Wearout failures 

Owing to corrosion, oxidation, breakdown 
of insulation, atomic migration, friction 
wear, shrinkage, fatigue, etc. 


Down Time and Repair Time 

It is now necessary to consider mean down time and mean time to repair (MDT, MTTR). 
There is frequently confusion between the two and it is therefore important to understand 
the difference. Down time, or outage, is the period during which equipment is in the failed 
state. A formal definition is usually avoided, owing to the difficulties of generalising about 
a parameter which may consist of different elements according to the system and its 
operating conditions. Consider the following examples which emphasise the problem: 

(a) A system not in continuous use may develop a fault while it is idle. The fault 
condition may not become evident until the system is required for operation. Is down time 
to be measured from the incidence of a fault, from the start of an alarm condition, or 
from the time the system would have been required? 


6 


Supplement to Br. Telecommun. Eng., Vol. 7, Oct. 1988 


















(b) In some cases, it may be economical or essential to leave an equipment in a faulty 
condition until a particular moment or until several similar failures have accrued. 

(c) Repair may have been completed but it may not be safe to restore the system to 
its operating condition immediately. Alternatively, owing to a cyclic type of situation it 
may be necessary to delay. When does down time cease under these circumstances? 

It is necessary, as can be seen from the above, to define down time as required for each 
system under given operating conditions and maintenance arrangements. MTTR and 
MDT, although overlapping, are not identical. Down time may commence before repair 
as in ( a ) above. Repair often involves an element of checkout or alignment which may 
extend beyond the outage. The definition and use of these terms will depend on whether 
availability or the maintenance resources are being considered. 

The significance of these terms is not always the same depending upon whether a 
system, a replicated unit or a replaceable module is being considered. Table 3 compares 
the effect of down time and repair time at different levels. 


TABLE 3 



Down Time 

Repair Time 

System 

Determines availability and hence 
cost of lost revenue 

Contributes to maintenance cost 

Redundant unit 

Determines system availability 

Contributes to maintenance cost 

Replaceable 

module 

Influences spares level 

Influences maintenance cost and 
availability of module as a spare 
part 


Figure 6 shows the element of down time and repair time. 


Figure 6 

Elements of down time and repair 
time 


z. 


Id) 

IW 

10 

id) 

10 

REALISATION 

ACCESS 

DIAGNOSIS 

SPARES 

REI 

REPLACE 

3 AIR 


CHECK 


\ \ 

1 f/ 

1_!»_l 


LOGISTIC TIME 


1 W 1 


1 ADMINISTRATIVE TIME 1 



i 

(0 , 

1 ACCESS 1 


ACTIVITIES WHICH MAY 
OCCUR SEVERAL TIMES 
AND IN NO SPECIFIC 
SEQUENCE 


(a) Realisation Time This is the time which elapses before the fault condition becomes 
apparent. This element is pertinent to availability but does not constitute part of the repair 
time. 

( b ) Access Time This involves the time, from realisation that a fault exists, to make 
contact with displays and test points and so commence fault finding. This does not include 
travel but the removal of covers and shields and the connection of test equipment. This 
is determined largely by mechanical design. 

(c) Diagnosis Time This is referred to as fault finding and includes adjustment of 
test equipment (for example, setting up an oscilloscope or generator), carrying out checks 
(for example, examining waveforms for comaparison with a handbook), interpretation of 
information gained (this may be aided by algorithms), verifying the conclusions drawn 
and deciding upon the corrective action. 

( d) Spare Part Procurement Part procurement can be from the ‘tool box’, by cannibal- 
isation or by taking a redundant identical assembly from some other part of the system. 
The time taken to move parts from a depot or store to the system is not included, being 
part of the logistic time. 


Supplement to Br.Telecommun. Eng., Vol. 7, Oct. 1988 


7 































( e) Replacement Time This involves removal of the faulty least replaceable assembly 
(LRA) followed by connection and wiring, as appropriate, of a replacement. The LRA is 
the replaceable item beyond which fault diagnosis does not continue. Replacement time 
is largely dependent on the choice of LRA and on mechanical design features such as the 
choice of connectors. 

(/) Checkout Time This involves verifying that the fault condition no longer exists 
and that the system is operational. It may be possible to restore the system to operation 
before completing the checkout in which case, although a repair activity, it does not all 
constitute down time. 

(g) Alignment Time As a result of inserting a new module into the system, adjustments 
may be required. As in the case of checkout, some or all of the alignment may fall outside 
the down time. 

( h) Logistic Time This is the time consumed waiting for spares, test gear, additional 
tools and manpower to be transported to the system. 

(0 Administrative Time This is a function of the system user’s organisation. Typical 
activities involve failure reporting (where this affects down time), allocation of repair 
tasks, manpower changeover due to demarcation arrangements, official breaks, disputes 
etc. 

Activities (b)-(g) are called active repair elements and activities ( h ) and (/) passive 
repair activities. Realisation time is not a repair activity but may be included in the 
MTTR where down time is the consideration. Checkout and alignment, although utilising 
manpower, can fall outside the down time. The active repair elements are determined by 
design, maintenance arrangements, environment, manpower, instructions, tools and test 
equipment. Logistic and administrative time is mainly determined by the maintenance 
environment; that is to say, the location of spares, equipment and manpower and the 
procedures for allocating tasks. 

Another parameter related to outage is repair rate (p). It is simply the down time 
expressed as a rate; therefore: 

p = 1/MTTR. 

Equipment and components are often described by means of the word ‘life’. This must 
not be confused with MTBF or MTTF. It refers to the onset of wear out and not to the 
level of reliability during the useful life. 


INTERRELATIONSHIPS OF PARAMETERS 

Returning to the model in Figure 3, consider the probability of an item failing in the 
interval between t and t + d/. This can be described in two ways: 

(a) The probability of failure in the interval t to t + df given that it has survived until 
time t. This is 


where A(f) is the failure rate. 

(b) The probability of failure in the interval t to t + dt unconditionally. This is 

M dr, 

where /(/) is the failure probability density function. 

The probability of survival to time t is defined as the reliability, R(t). The rule of 
conditional probability therefore dictates that: 


2(f)df = 


f(t)dt 
R(t) ’ 


Therefore 

However, if /(f) is the probability of failure in df then: 




/(f) df = probability of failure 0 to f = 1 — R(t). 


Differentiating both sides: 


/( 0 = - 


dm 

dt ' 


Substituting equation (2) into equation (1): 


_, (r) .«w x _L. 

v ' dr R(t) 


• d) 


• ( 2 ) 


8 


Supplement to Br. Telecommun. Eng., Vol. 7, Oct. 1988 








Therefore integrating both sides: 


ft rmn 




A word of explanation concerning the limits of integration is required. A(r) is integrated 
with respect to time from 0 to t. 1 /R(t) is, however, being integrated with respect to R(t). 
Now when t = 0, R(t)= 1 and at f, the reliability R(t) is, by definition, R(t). Integrating 
then: 


1 


- x(t)dt=\ ogc R(t)\r, 


= log c «(0-log e 1, 
= log e R(f). 


But if a = e fc then b = log c a, so that: 


R(t) = c\ p 


A(t)df 


If failure rate is now assumed to be constant: 


w: 

be con: 

R{t) = exp | - J A dr | = exp — ),t | f 0 . 


... 0 ) 


..(4) 


Therefore, R(t) = e 

In order to find the MTBF, let N - k y the number surviving at /, be N s (t). Then, R(t) 
= N % (t)/N. 

In each interval df, the time accumulated will be N s (t)dt. 


At oo the total will be j* N s (t)dt. 
Hence the MTBF will be given by: 

6 


= r x iV,(0df _ f« 

Jo Wo Jo 

-j: 


R(t)dt. 


R(t)dt. 


...( 5 ) 


This is the general expression for MTBF and always holds. In the special case of 
R(t) = e~*\ then 


6 


= J e A, dt. 


e= j- 


...( 6 ) 


Note that inverting failure rate to obtain MTBF, and vice versa, is valid only for the 
constant failure rate case. 

A useful parameter is availability, which describes the overall amount of available 
uptime. It is determined by both the reliability and the maintainability of the item. 
Availability is, therefore: 

up time 
A =——, 
total time 


up time 

up time + down time’ 
average of (t) 

average of (t) + mean down time’ 
MTBF 

MTBF + MDT 


This is known as the steady-state availability and can be expressed as a ratio or as a 
percentage. 

Sometimes it is more convenient to use unavailability 


A = 


l-A = 


AMDT 
1 +/MDT 


A MDT. 


Supplement to Br.Telecommun. Eng. y Vol. 7, Oct. 1988 


9 











Choosing the Appropriate Parameter 

It is clear that there are many parameters for describing the reliability and maintainability 
characteristics of an item. In any particular instance, there is likely to be one parameter 
more appropriate than the others. Although there are no hard and fast rules, the following 
guidelines may be of some assistance: 

Only relevant when constant. Applicable to electronic com¬ 
ponents and some mechanical items. 

Often used to describe equipment or system reliability. Of use 
when calculating maintenance costs. 

Used where the probability of failure is of interest as, for 
example, in aircraft landings where safety is the prime consider¬ 
ation. 

Seldom used as such. 

Often expressed in percentile terms such as, the 95 percentile 
repair time shall be 1 hour. This means that only 5% of the 
repair actions shall exceed 1 hour. 

Used where the outage affects system reliabilty or availability. 
Often expressed in percentile terms. 

Very useful where there is a high cost of lost revenue, owing 
to outage. Combines reliability and maintainability. Ideal for 
describing process plant. 

Beware of the confusion between MTTF and mean life. 
Whereas the mean life describes the average life of an item 
taking into account wear out, the MTTF is the average time 
between failures. The difference is clear if one considers the 
simple example of a match. It burns only for a few seconds and 
thus has a short mean life. Few matches fail to light, however, 
so the MTTF is high. 

There are sources of standard definitions such as: 

British Standard BS 4200 (Part 1) 

IEC Publication 271 
US MIL STD 72IB 
UK Defence Standard 00-5 (Part 1) 

ACHIEVING RELIABILITY IN DESIGN 

Reference is often made to the spectacular reliability of many nineteenth-century engin¬ 
eering feats. Telford and Brunei indeed left a heritage of longstanding edifices such as 
the Menai and Clifton bridges. Fame is secured by their continued existence but little is 
remembered of the failures of that age. If, however, we concentrate on the success and 
seek to identify the characteristics of design or construction that have given them a life 
span far in excess of many twentieth-century products, then two important considerations 
arise. 

Firstly, the reliability of a structure or assembly will be influenced by its complexity. 
The fewer sub-assemblies and the fewer types of material and component involved, then 
the greater is the likelihood of an inherently reliable product. The modern equipment and 
products which we condemn as unreliable often comprise thousands of piece parts, 
involving many different materials all of which interact within various tolerances. Telford 
and Brunei’s structures, on the other hand, are less complex, comprising fewer types of 
material with relatively few well-proven modules. 

Secondly, we should consider the two most common methods of achieving reliability. 
They are: 

• Duplication The use of additional, redundant, parts whose individual failure does not 
cause the overall product to fail. 

• Excess Strength Deliberate design to withstand stresses higher than are anticipated. 
Small increases in strength for a given anticipated stress result in substantial decreases 
in failure rate. This applies equally to mechanical and electrical items. 

Although effective, both are costly methods of achieving high reliability and long life. 
The nineteenth-century engineers may have been less prone to material cost constraints, 
or to the difficulties of equipment complexity, compared with today’s designers and that 
may account for many of the successes of that age. No doubt many ventures did involve 
new materials and methods and had to be implemented under severe cost restaints. Perhaps 
they are the ones which have not survived to complete the comparison. 

The purpose of the foregoing remarks is to point out that reliability is a ‘built in’ feature 
of any construction, be it mechanical, structural or electrical, and that it can be increased 


Failure Rate 
MTBF and MTTF 
Reliability 

Maintainability 
Mean Time to Repair 

Mean Down Time 
Availability 

Mean Life 


10 


Supplement to Br. Telecommun. Eng. y Vol. 7, Oct. 1988 


by design effort or by the addition of material. It is clear that the cost of such enhancement 
must be offset by at least the equivalent saving in maintenance in order to justify it. 
Design activities which influence the achieved reliability are: 

• Specifying and allocating the requirement. 

• Derating components by using higher stress ratings. 

• Environmental stress protection. 

• Minimising the complexity. 

• Choice of parts. 

• Use of redundancy. 

• Choice of technology. 

• Failure mode analysis. 

• Qualification testing. 


FAILURE RATE 
(10 ~ 9 per h| 


APPLIED VOLTAGE 

RATED VOLTAGE 



AMBIENT TEMPERATURE 
(0 


Figure 7 


Specifying and Allocating the Requirement 

This is often not given the attention it requires; this results in unrealistic requirements in 
some areas and overdesign in others. Simple reliability predictions (one significant 
figure) during the early design decisions provide information as to which elements in a 
configuration will contribute the major proportion of the failures. The reliability plan can 
therefore be focussed on the most critical areas. 

Derating 

The principle of operating a component part below the rated stress level of a parameter 
in order to obtain a longer or more reliable life is well known. It is of particular 
interest in electronics where under-rating of voltage and temperature produces spectacular 
improvements in reliability. Stresses can be divided into two broad categories: environ¬ 
mental and operating. 

Operating stresses are present when a device is active. Examples are voltage, current, 
self-generated temperature and self-induced vibration. These have a marked effect on the 
frequency of random failures as well as hastening wear out. Figure 7 shows the relationship 
of failure rate to voltage and temperature stress for a typical wet aluminium capacitor. 

Environmental Stress Protection 

Environmental stress hastens the onset of wear out by contributing to physical deterio¬ 
ration. Factors included are given in Table 4. 


TABLE 4 


Stress 

Symptom 

Action 

High temperature 

Insulation materials deteriorate. 
Chemical reactions accelerate 

Dissipate heat. Minimise thermal 
contact. Use fins. Increase con¬ 
ductor sizes on PCBs. Provide 
conduction paths 

Low temperature 

Mechanical contraction damage. 
Insulation materials deteriorate 

Apply heat and thermal insu¬ 
lation 

Thermal shock 

Mechanical damage within LSI 
components 

Shielding 

Mechanical shock 

Component and connector 

damage 

Mechanical design. Use of 
mountings 

Vibration 

Hastens wear out and causes con¬ 
nector failure 

Mechanical design 

Humidity 

Coupled with temperature cyc¬ 
ling causes ‘pumping’—filling up 
with water 

Sealing. Use of silica gel 

Salt atmosphere 

Corrision and insulation degra¬ 
dation 

Mechanical protection 

Electromagnetic radi¬ 
ation 

Interference to electrical signals 

Shielding and part selection 

Dust 

Long-term degradation of insu¬ 
lation. Increased contact resist¬ 
ance 

Sealing. Self-cleaning contacts 

Biological effects 

Decayed insulation material 

Mechanical and chemical protec¬ 
tion 

Acoustic noise 

Electrical interference due to 
microphonic effects 

Mechanical buffers 

Reactive gases 

Corrosion of contacts 

Physical seals 


Supplement to Br.Telecommun. Eng. y Vol. 7, Oct. 1988 


11 










Complexity 

It is a general rule that adding components to a system reduces the reliability. For 
electronic equipment integration levels should be chosen which permit the minimum 
component count. The same principle applies to mechanical items where one should aim 
for the minimum number of parts and bearing surfaces. 


Part Selection 

The quality level of components has a direct effect on failure rate. This is due to the effect 
of screening out populations of inherent defectives. 


Redundancy 

Redundancy may be applied in a number of ways. The effect of this on reliability is 
discussed in the section entitled ‘Reliability Assessment and Modelling’. It should be 
remembered, however, that redundancy as a means of reliability improvement does carry 
a number of penalties: 

• Space 

• Weight 

• Cost 

• More failures at the component level 

• Common mode failures (see ‘Reliability Assessment and Modelling’) 

• Diminishing returns (see ‘Reliability Assessment and Modelling’) 

The decision to use redundancy must be based on an analysis of the trade-offs involved. 
It may prove the only available method when other techniques have been exhausted. Its 
application is not without penalties since it increases weight, space and cost, and the 
increase in number of parts results in an increase in maintenance and spares holding costs. 
In general, the reliability gain obtained from additional elements decreases beyond a few 
duplicated elements owing to the series reliability of switching or other devices needed to 
implement the particular configuration employed. 


Choice of Technology 

The type of component technology chosen has a major effect on system reliability. It is 
commonly assumed that micro-electronic designs are automatically more reliable than 
the alternatives. This is not necessarily the case since there is a temptation to provide 
many more facilites because of the sophistication of the processor. Thus additional facilities 
lead to additional interfaces and a higher component count. 


Failure Mode and Effect Analysis 

It is a mistake to regard failure mode analysis (FMA) and fault tree analysis (see 
‘Reliability Assessment and Modelling’) as being solely for the purpose of assessing a 
numerical value of reliability. In practice, the real benefit is derived from identifying the 
main causes of failure and feeding back the information by way of design modifications. 
It is the relative frequency of failures which is the more useful information rather than 
the absolute total. 


Qualification Testing 

Qualification testing is an engineering responsibility not a quality control function. The 
qualification test plan should aim to include tests to verify all the parameters (and their 
limits) in the requirements specification. In the case of the reliability parameter, it may 
not be possible to accumulate sufficient time to prove a particular failure rate. In that 
case, the test may continue into field use. Items to be verified include: 

• Function Specified performance at defined limits and margins. 

• Environment Ambient temperature and humidity for use, storage, etc. Performance 
at the extremes of the specified environment should be included 

• Life At specified performance levels and under storage conditions. 

• Reliability Observed MTBF under all conditions. 

• Maintainability MTTR for defined test equipment, spares, manual and staff. 

• Maintenance Is the routine and corrective maintenance requirement compatible with 
use? 

• Packaging and Transport Test under real conditions including shock tests. 

• Physical Characteristics Size, weight, power consumption, etc. 

• Ergonomics Consider interface with operators and maintenance personnel. 

• Testability Consider test equipment and time required for production models. 

• Safety Use an approved test house such as BSI or the British Electrotechnical 
Approvals Board. 


12 


Supplement to Br. Telecommun. Eng., Vol. 7, Oct. 1988 




ACHIEVING RELIABILITY IN USE 


The techniques described in the previous section (‘Achieving Reliability in Design’) 
influence the basic design reliability of an item of equipment. Figure 8 emphasises that 
the achieved reliability will inevitably be less owing to a number of factors in manufacture 
and use. 


Figure 8 



The main activities for minimising these additional factors are: 

• burn-in and screening, 

• preventive maintenance, 

• quality control and test, and 

• reliability growth. 


Burn-in and Screening 

For an established design, the early failures portion of the bathtub curve is due to 
populations of items having inherent weaknesses due to minute variations and defects in 
the manufacturing process. Furthermore, it is increasingly held that electronic failures— 
even in the constant failure rate part of the curve—are due to microscopic defects in the 
physical build of the item. The effects of physical and chemical processes with time cause 
failures to occur both in the early failures and constant failure rate portions of the bathtub 
curve. Burn-in and screening are thus means of enhancing component reliability. 

Burn-in is the process of operating items at elevated stress levels (particularly 
temperature, humidity and voltage) in order to accelerate the processes leading to failure. 
The populations of defective items are thus reduced. 

Screening is an enhancement of the quality control process whereby additional detailed 
visual and electrical/mechanical tests seek to reveal defective features which would 
otherwise increase the population of ‘weak’ items: 

• internal visual inspection, 

• X-ray inspection, 

• high temperature, 

• temperature cycling, 

• thermal shock, 

• acceleration, 

• mechanical shock, 

• vibration, 

• voltage, and 

• electrical cycling. 

The relationship between various defined levels of burn-in and screening and the eventual 
failure rate levels is recognised and has, in the case of electronic components, become 
formalised. For micro-electronic devices, US MIL STD 883 provides a uniform set of 
test, screening and burn-in procedures. These include tests for moisture resistance, high 
temperature, shock, dimensions, electrical load and so on. The effect is to eliminate the 
defective items mentioned above. The tests are graded into three classes in order to take 


Supplement to Br.Telecommun. Eng. y Vol. 7, Oct. 1988 


13 












account of the need for different reliability requirements at appropriate cost levels. These 
levels are: 

Class C The least stringent which requires 100% internal visual inspection. There 
are electrical tests at 25°C but no burn-in. 

Class B In addition to the requirements of Class C, there is 160 hours of burn-in 
at 125°C and electrical tests at temperature extremes (high and low). 

Class S In addition to the tests in Class B, there is longer burn-in (240 hours) and 
more severe tests including 72 hours reverse bias at 150°C. 

The overall standardisation and quality assurance programmes described in US-MIL- 
M-38510 call for the MIL 883 tests procedures. The UK counterpart to this system of 
controls is British Standard BS 9000, which functions as a four-tier hierarchy of specifi¬ 
cations from the general requirements at the top, through generic requirements, to detail 
component manufacture and test details at the bottom. Approximate equivalents for the 
screening levels are given in Table 5. 


TABLE 5 


MIL833 

BS9400 

Relative Cost 
(approximate) 

S 

A 

10 

B 

B 

5 

C 

C 

3 


D 

1 

0*5 (plastic) 


Preventive Maintenance 


Routine 

This consists of cleaning, adjusting and lubricating on a planned basis. It is difficult to 
quantify the effects of this type of activity except over a long period. Opinion is divided 
as to its value. Examples are often quoted where more failures are caused than prevented. 
A common-sense approach should be taken in defining these routines and they should be 
evaluated on early models in the field. 


Preventive Replacement 

This requires a knowledge of the wear-out distribution of a particular item in the 
equipment. A policy of replacing items at a particular ‘age’ even if they have not failed 
is adopted. This is applicable only to the wear-out situation, and Figure 9 gives the example 
of the failure distribution of a filament lamp. In this example, the failures are distributed 
according to the Gaussian (normal) law. Other examples may exhibit more complex 
distributions but the principle of selecting a time for automatic replacement based on the 
probability of failure is the same. In this example, if lamps are replaced at 1000 hours, 
then 50% will fail before replacement. If it is decided to replace them earlier, at a time 
one standard deviation before the mean, then approximately 15-9% will fail before 
replacement. Some potential useful life is thus traded against the inconvenience and cost 
of a failure. 

In the next section (‘Data Interpretation and Test’), the treatment of wear-out distri¬ 
butions describes how to quantify the appropriate replacement time for a given risk. It is 
important to establish that an item actually has an increasing failure rate before deciding 
to apply preventive replacement. 



Figure 9 


Dormant Failures 

Many faults do not cause immediate system failure and hence cause no harm. Only a 
further fault or combination of stress conditions reveals this condition; nevertheless, its 
existence may well accelerate a system failure. A failed component in a circuit which, 
although not causing malfunction, places greater stresses on those remaining is one 
example. Another is a failed alarm circuit, being particularly undesirable since it may 
conceal a further failure until damage has resulted. Routine checks for dormant faults 
are expensive but may well be justified as part of the routine maintenance procedure for 
key items. 


Drift Conditions (Degradation) 

Measurements of key parameters aimed at detecting whether they are drifting towards 
the limit would result in preventive replacement and enhanced reliability. This is an 
expensive procedure and is likely to be justified only in a few cases. 


14 


Supplement to Br. Telecommun . Eng., Vol. 7, Oct. 1988 










Repair and Redundancy 

Periodic inspection and early repair of unattended redundant units also enhances reliability. 
MTBF is thus dependent on repair times as well as unit reliability. See ‘Reliability 
Assessment and Modelling’ section. 

Quality Control and Test 

The early failures of the bathtub curve are associated both with manufacturing processes 
(soldering, wrapping, assembly) and with marginally acceptable components. Both types 
of failure will be reduced by the application of quality control, particularly if inspection 
and test are fed with failure data derived from field information. The procedures and 
methods involved in a quality assurance system would easily occupy another volume and 
there is already much excellent literature on the subject. Field feedback is an activity 
specifically related to reliability and maintainability. 

Reliability Growth 

This is a means of reliability enhancement which arises from the continual feedback of 
field failure information. The more formal the programme the faster the growth. 

One method of plotting the growth is the use of CUSUM (cumulative sum chart) plots. 
In this technique, an anticipated target MTBF is chosen and the deviations are plotted 
against time. The effect is to show the MTBF by the slope of the plot, which is more 
sensitive to changes in reliability. 

The example in Table 6 shows the number of failures after each 100 hours of running 
of a generator. The CUSUM is plotted in Figure 10. 


TABLE 6 


Cumulative 

Hours 

Failures 

Anticipated Failures if MTBF 
were 200 hours 

Deviation 

CUSUM 

100 

1 

0-5 

+0-5 

4-0-5 

200 

1 

0-5 

4-0-5 

4-1 

300 

2 

0-5 

4-1*5 

4-2-5 

400 

1 

05 

4-0*5 

4-3 

500 

0 

0-5 

-0-5 

4-2-5 

600 

1 

0-5 

4-0-5 

4-3 

700 

0 

0-5 

-0-5 

4-2-5 

800 

0 

0-5 

-0-5 

4-2 

900 

0 

0-5 

-0*5 

4-1-5 

1000 

0 

0-5 

-0-5 

4-1 


Figure 10 
CUSUM plot 



The CUSUM was plotted for an objective MTBF of 200 hours. It shows that for the 
first 400 hours the MTBF trend was in the order of half the requirement. From 400 to 
600 hours, there was an improvement to about 200 hours MTBF, and thereafter there is 
evidence of reliability growth. The plot is sensitive to the changes in trend as can be seen 
from the above. 

The reader will notice that the axis of the deviation has been inverted so that negative 
variations produce an upward trend. This is often done in reliability CUSUM work in 
order to reflect improving MTBFs by an upward curve, and visa versa. 


Supplement to Br.Telecommun. Eng., Vol. 7, Oct. 1988 


15 

















DATA INTERPRETATION AND TEST 


The bathtub curve, Figure 5, illustrated that there are essentially three types of failure 
distribution. Two of these involve variable failure rate and the third is the simplest case 
of constant failure rate (that is to say random failures). 


Constant Failure Rate 


Random failures can be described by a single parameter, namely failure rate. Returning 
to Figure 3, it was noted that the point estimate of failure rate is given by: 

X = k/T. 

When k is large (say 50) the error inherent in this approach is small because repeated 
trials are likely to involve only small variations of X. Varying values of k , from say 48 to 
52, involve only small changes of X having regard to the tolerances which apply to that 
parameter. If, on the other hand, A: is a very small number (even zero) the situation is 
quite different. Repeated trials will yield differing values of k and the uncertainity in X is 
unacceptable. 

It is thus necessary to make use of statistical inference tools in order to interpret such 
data. At this stage, the concept of ‘confidence’ is required. This is best illustrated by 
means of an example. Figure 11 shows a distribution of heights of a group of people in 
histogram form. Superimposed onto the histogram is a curve of the normal distribution. 

In the figure, there is a good fit between the normal curve, having a mean of 70 inches 
and a standard deviation (measure of spread) of 1 inch, and the heights of the group in 
question. Consider, now, a person drawn, at random, from the group. It is permissable to 
predict, from a knowledge of the normal distribution, that the person will be 70 inches 
tall or more providing that it is stated that the prediction is made with 50% confidence. 
This really means that we anticipate being correct 50% of the time if we repeat the 
experiment. On this basis, an infinite number of possible heights can be stated, providing 
that an appropriate confidence level accompanies each value. For example: 

71 inches or more at 15-9% confidence 

72 inches or more at 2-3% confidence 

73 inches or more at 0* 1% confidence 



The inferred measurement and the confidence level can, hence, be traded off against 
each other. 

By applying this process to the failure rate problem, we can make use of the fact that, 
for time truncated failure data: 


Xcl = x 2 /(2n, 

where Xcl is the upper limit of failure rate at the confidence stated, and 

X 2 is the value of x 2 for n = 2 (k + 1) degrees of freedom at (1 — CL) = a. 
This is best understood by means of an example. 


Example: 

A replacement test involving 50 devices is run for 100 hours and then truncated. Calculate 
the MTBF (single-sided lower limit) at 60% confidence: 

(a) if there are two failures, and 

( b) if there are zero failures. 

Hence, T = 50 X 100 = 5000. 

CL = 0-6 (confidence level). 

Thus 1 - CL = 0-4 = a. 

(Note a in the table of percentage points of the x 2 distribution given in Table 7.) 

The values of n are thus: 

(a) n = 2 (k + 1) = 6, where k = 2. 

(b) n = 2 (k + 1) = 2, where k = 0. 

Thus, 

(a) n = 6, a = 0-4, x 2 = 6-21. 

(b) n = 2 y a = 0-4, X 2 = 1-83. 

The failure rates are thus: 

( a ) X 60% = 6-21/(2 X 5000) = 621 X 10” 6 per hour. 

(b) X 60% = 1 -83/(2 X 5000) = 183 X 10“ 6 per hour. 

The reader may care to recalculate the above failure rates at 90% confidence (that is, 
a = 0-1). The result will be that for greater confidence a larger failure rate must be 
assumed. Failure rate (or for that matter MTBF) can be traded against confidence for 
any given data. 


16 


Supplement to Br. Telecommun . Eng., Vol. 7, Oct. 1988 













TABLE 7 

Percentage Points of the x 2 Distribution 


a 

n 

0-9995 

0-999 0-995 0-990 

0-975 

0-95 

0-90 

0-80 

0-70 

0-60 

0-50 

2 

0 - 0 M 00 0 - 0*200 0-0100 0-0201 

0-0506 

0-103 0-211 

0-446 0-713 1 -02 

1-39 

4 

0-0639 

0-0908 0-207 

0-297 

0-484 

0-711 1-06 

0-165 2-19 

2-75 

3-36 

6 

0-299 

0-381 

0-676 

0-872 

1-24 

1-64 

2-20 

3-07 

3-83 

4-57 

5-35 

8 

0-710 

0-857 

1-34 

1-65 

2-18 

2-73 

3-49 

4-59 

5-53 

6-42 

7-34 

10 

1-26 

1-48 

2-16 

2-56 

3-25 

3-94 

4-87 

6-18 

7-27 

8-30 

9-34 

12 

1 93 

2-21 

3-07 

3-57 

4-40 

5-23 

6-30 

7-81 

9-03 

10-2 

11-3 

14 

2-70 

3-04 

4-07 

4-66 

5-63 

6-57 

7-79 

9-47 

10-8 

12-1 

13-3 

16 

3-54 

3-94 

5-14 

5-81 

6-91 

7-96 

9-31 

11-2 

12-6 

14-0 

15-3 

18 

4-44 

4-90 

6-26 

7-01 

8-23 

9-39 

10-9 

12-9 

14-4 

15-9 

17-3 

20 

5-40 

5-92 

7-43 

8-26 

9-59 

10-9 

12-4 

14-6 

16-3 

17-8 

19-3 

22 

6-40 

6-98 

8-64 

9-54 

110 

12-3 

14-0 

16-3 

18-1 

19-7 

21-3 

24 

7-45 

8-08 

9-98 

10-9 

12-4 

13-8 

15-7 

18-1 

19-9 

21-7 

23-3 

26 

8-54 

9-22 

11-2 

12-2 

13-8 

15-4 

17-3 

19-8 

21-8 

23-6 

25-3 

28 

9-66 

10-4 

125 

13-6 

15-3 

16-9 

18-9 

21-6 

23-6 

25-5 

27-3 

30 

10-8 

11 6 

13-8 

15-0 

16-8 

18-5 

20-6 

23-4 

25-5 

27-4 

29-3 

32 

120 

12-8 

15-1 

16-4 

18-3 

20-1 

22-3 

25-1 

27-4 

29-4 

31-3 

34 

132 

14-1 

165 

17-8 

19-8 

21-7 

24-0 

26-9 

29-2 

31-3 

33-3 

36 

14-4 

15-3 

17-9 

19-2 

21-3 

23-3 

25-6 

28-7 

31-1 

33-3 

35-3 

38 

156 

16-6 

19-3 

20-7 

22-9 

24-9 

27-3 

30-5 

33-0 

35-2 

37-3 

40 

16-9 

179 

20-7 

22-2 

24-4 

26-5 

29-1 

32-3 

34-9 

37-1 

39-3 


0-40 

0-30 

0-20 

0-10 

0-05 

0-025 

0-01 

0-005 

0 001 

0-0005 

1-83 

2-41 

3-22 

4-61 

5-99 

7-38 

9-21 

10-6 

13-8 

15-2 

4-04 

4-88 

5-99 

7-78 

9-49 

111 

13-3 

14-9 

18-5 

20-0 

6-21 

7-23 

8-56 

10-6 

12-6 

14-4 

16 8 

18-5 

22-5 

24-1 

8-35 

9-52 

110 

13-4 

15-5 

17-5 

20-1 

22-0 

26-1 

27-9 

10-5 

11-8 

13-4 

16-0 

18-3 

20-5 

23-2 

25-2 

29-6 

31 -4 

12-6 

14-0 

15-8 

18-5 

21-0 

23-3 

26-2 

28-3 

32-9 

34-8 

14-7 

16-2 

18-2 

21-1 

23-7 

26-1 

29-1 

31-3 

36-1 

38-1 

168 

18-4 

20-5 

23-5 

26-3 

28-8 

32-0 

34-3 

39-3 

41-3 

18-9 

20-6 

22-8 

26-0 

28-9 

31-5 

34-8 

37-2 

42-3 

44-4 

21-0 

22-8 

25-0 

28-4 

31-4 

34-2 

37-6 

40-0 

45-3 

47-5 

23-0 

24-9 

27-3 

30-8 

33-9 

36-8 

40-3 

42-8 

48-3 

50-5 

25-1 

27-1 

29-6 

33-2 

36-4 

39-4 

43-0 

45-6 

51-2 

53-5 

27-2 

29-2 

31-8 

35-6 

38-9 

41 -9 

45-6 

48-3 

54-1 

56-4 

29-2 

31-4 

34-0 

37-9 

41-3 

44-5 

48-3 

51 -0 

56-9 

59-3 

31-3 

33-5 

36-3 

40-3 

43-8 

47-0 

50-9 

53-7 

59-7 

62-2 

33-4 

35-7 

38-5 

42-6 

46-2 

49-5 

53-5 

56-3 

62-5 

65-0 

35-4 

37-8 

40-7 

44-9 

48-6 

52-0 

56-1 

59-0 

65-2 

67-8 

37-5 

39-9 

42-9 

47-2 

51 -0 

54-4 

58-6 

61-6 

68-0 

70-6 

39-6 

42-0 

45-1 

49-5 

53-4 

56-9 

61 -2 

64-2 

70-7 

73-4 

41-6 

44-2 

47-3 

51 -8 

55-8 

59-3 

63-7 

668 

73-4 

76-1 


One further fact needs to be learned. If the data is failure truncated, that is to say 
terminated at the occurence of a particular failure rather than at a specific time, then 
n = 2k instead of 2 (k + 1). 

Where failure rate is calculated in this way, it is that failure rate or less that we expect 
at the degree of confidence stated. Where MTBF is calculated, it is that MTBF or greater 
that is anticipated. 

The following list of steps summarises the use of the x 2 tables for interpretting the 
results of reliability tests or data. 

(7) Measure T (accumulated test hours) and k (number of failures). 

(2) Select a confidence level and let a = (1 — confidence level). 

(5) Let n = 2k (2 k + 2 for lower limit MTBF in time-truncated test). 

( 4) Note the value of x 2 from tables. 

(5) Let MTBF at the given confidence level be 2 T/x 2 or X be x 2 /27\ 


Variable Failure Rate 

The above x 2 method does not cover the infant mortality or wear-out conditions. There 
is one distribution, however, which is equally applicable to infant mortality, constant 
failure rate and wear out. This so called Weibull distribution (named after Waloddi 
Weibull) is extremely flexible and versatile. It is simple to use if appropriate Weibull scale 
paper is available and its application is now described. 

When analysing failure data, one usually attempts to describe the pattern of variation 
in terms of a known distribution. Often the ‘normal’ distribution (for wear-out phase) or 
the exponential (for prime of life phase) is either assumed or taken to check for goodness 
of fit. The Weibull distribution is of wider applicability than these more conventional ones 
and may be used either: 

(a) as a second line of attack when they do not fit, or 

(b) for general application in the first instance. Particular solutions of the Weibull fit 
for the normal, exponential and other cases will now be described. 

As distinct from the exponential distribution which has one parameter (failure rate) 
and the normal distribution which is totally described by two parameters (mean life, and 
standard deviation), the Weibull distribution has three parameters. (3 is a shape parameter 
which can take the value of any positive real number. This is a very powerful parameter 
which gives the Weibull distribution its versatility in representing any of a number of 
distribution forms. For example: 

0 < 1 -0 : decreasing failure rate with time. 

0 = 1*0: constant failure rate. 

13 > 1 - 0 : increasing failure rate. 

0 = 2-0 : failure rate increasing at a linear rate. 

(3 > 2-0 : failure rate increasing at a rate equal to the power 0—1. 

The other two parameters are: 

r)> the scale parameter, and 

7 , the location parameter. 


Supplement to Br.Telecommun. Eng. y Vol. 7, Oct. 1988 


17 









Their relationship is described by: 


fi(0 , exp {_(!_»)'}. 

This can be reduced to a straight line equation provided loglog by log graph paper is 
used. Once again, an example is used to explain this form of probability plotting. 

Ten devices were put on test and permitted to fail without replacement. The time at 
which each device failed was noted and from the test information the following must be 
determined: 

(a) if there is a Weibull distribution which fits these data, 

(b) if so, the values of 7 , 77 and /?, 

(c) the probability of items surviving for specified lengths of time, 

( 1 d) if the failure rate is increasing, decreasing or constant, and 

(e) the MTBF. 

The results are shown in Table 8 against the median ranks for sample size 10. The ten 
points are plotted on the Weibull paper (Figure 12) and a straight line is obtained. 


TABLE 8 


Cumulative failures, 6-7 16-2 25-9 35-6 45-2 54-8 64-5 74-1 83-8 93-3 

Q t (%) median rank 

Time, / (hours X 100) 1-7 3-5 5-0 6-4 8-0 9-6 11 13 18 22 



Figure 12 


Figure 12 is loglog by log graph paper with suitable scales for cumlative percentage 
failure and time. Cumulative percentage failure is effectively the unreliability and is 
estimated by taking each failure in turn from median ranking tables of the appropriate 
sample size. It should be noted that the sample size, in this case, is the number of failures 
observed. A test yielding 10 failures from 25 items would require the first 10 terms of the 
median ranking table for sample size 25. 

The straight line tells us that the Weibull distribution is applicable and the parameters 
are determined as follows: 

/?: The slope yields the value of which is obtained by taking a line parallel to the data 
line but through the origin of the construction in Figure 12. The value of 0 is shown 
by the intersection with the arc. Here = 1-5. 

77 : 7] is obtained by taking a horizontal line from the origin of the construction across to 

the data line and then reading the corresponding value of t. 

7 : Since the data yields a straight line, 7 = 0 and the Weibull expression reduces to a 
2-parameter distribution. More complex cases are treated in Reference 1. 

The reliability expression is therefore: 


* ( °- e,p HlTio) }• 


18 


Supplement to Br. Telecommun. Eng., Vol. 7, Oct. 1988 






































The probability of survival to t = 1000 hours is therefore: 

fl(1000) = e “°- 855 = 42-5%. 

The test shows a wear-out situation since 0, which is known as the shape parameter , 

> 1 . 

It now remains to evaluate the MTBF. This was seen in the ‘Interelationship of 
Parameters’ section to be the integral from zero to infinity of R(t). Table 9 enables this 
step to be short-cut. 

Since 0 = 1-5, then MTBF /77 = 0-903 and MTBF = 0-903 X 1110 = 1002 hours. 
Since median rank tables have been used, the MTBF and reliability values calculated are 
at the 50% confidence level. In the example, time was recorded in hours but there is no 
reason why a more appropriate scale should not be used such as number of operations or 
cycles. The MTBF would then be quoted as ‘mean number of cycles between failures’. 

TABLE 9 


0 

MTBF 

V 

0 

MTBF 

V 

0 

MTBF 

V 

0 

MTBF 

V 

0-0 

00 

1-0 

1-000 

2-0 

0-886 

3-0 

0-894 

0-1 

10 ! 

1-1 

0-965 

2-1 

0-886 

3-1 

0-894 

0-2 

5! 

1-2 

0-941 

2-2 

0-886 

3-2 

0-896 

0-3 

9-261 

1-3 

0-923 

2-3 

0-886 

3-3 

0-897 

0-4 

3-323 

1-4 

0-911 

2-4 

0-886 

3-4 

0-898 

0-5 

2-000 

1-5 

0-903 

2-5 

0-887 

3-5 

0-900 

0-6 

1-505 

1-6 

0-897 

2-6 

0-888 

3-6 

0-901 

0-7 

1-266 

1-7 

0-892 

2-7 

0-889 

3-7 

0-902 

0-8 

1-133 

1-8 

0-889 

2-8 

0-890 

3-8 

0-904 

0-9 

1-052 

1-9 

0-887 

2-9 

0-892 

3-9 

0-905 







4-0 

0-906 


For examples of other than 10 items, a set of median ranking tables is required. Since 
space does not permit a full set to be included, the following approximation is given. For 
sample size N , the rth rank is obtained from: 

r- 0-3 


N + 0-4 

The dangers of attempting to construct a Weibull plot with too few points should be 
noted. A satisfactory result will not be obtained with less than at least six points. Tests 
yielding 0, 1, 2 and even 3 failures do not enable any changing failure rate to be observed. 
In these cases, constant failure rate must be assumed and the x 2 test used. This is valid 
provided that the information extracted is applied only to the same time range as the test. 

Care must be taken in the choice of the appropriate ranking table. N is the number of 
items in the test and r is the number that failed, in other words the number of data points. 
In the example, N was 10 not because the number of failures was 10 but because it was 
a sample size. As it happens the case where all 10 failed was considered. 


Continuous Processes 

If a particular component or equipment has a wear-out characteristic and its Weibull 
parameters have been obtained, then it is possible to calculate the probability of failure 
for any value of time. A time for preventive replacement can be chosen against a given 
probability that the device will fail beforehand. 

In some cases, a population of variable failure rate devices, such as lamps, may be 
regarded as failing at random. Imagine a building using a large number of lamps where 
each is replaced as it fails so that, after several times the mean life has elapsed, the 
population will be made up of lamps at different points in their life. Figure 13 shows the 
superimposition of successive generations as a result of which failures are occurring at 
random. 



Figure 13 - 

Supplement to Br.Telecommun. Eng., Vol. 7, Oct. 1988 


19 






















In this way, a continuous process with renewal of items allows a wear-out mechanism 
to appear as if there were constant failure rate. A Weibull plot of data obtained in this 
way would not be valid. 

Consider the case of a system exhibiting software failures. If the times between each 
new failure are recorded, then a list of times to failure will be obtained. It is tempting to 
plot these on Weibull paper but, as can be seen from the above, this would be incorrect. 
The Weibull process applies only to a specific population of items and the time to failure 
of each item. 


RELIABILITY ASSESSMENT AND MODELLING 


Why Carry Out Reliability Predictions? 

Reliability prediction is the process of calculating the anticipated system reliability from 
assumed component failure rates. It provides a quantitative measure of how close a design 
comes to meeting the design objectives and permits comparisons between different 
proposals to be made. It has already been emphasised that reliability prediction is an 
imprecise calculation, but it is nevertheless a valuable exercise for the following reasons: 

• It provides an early indication of a system’s potential to meet the design reliability 
requirements. 

• It enables an assessment of life cycle costs to be carried out. 

• It enables one to establish which components, or areas, in a design contribute to the 
major portion of the unreliability. 

• It enables trade-offs to be made, for example, between reliability and maintainability 
in achieving a given availability. 

• Because the contract calls for it. 

There are three basic approachs to system assessment and these are: 

• Parts Count which involves a simple addition of component failure rates with little 
allowance for the stress levels involved. This is quick and easy to perform and provides a 
worst-case reliability prediction. Although specific failure modes are not addressed, an 
indication of the overall maintenance commitment is provided. 

• Failure Mode Analysis involving a stress analysis of each component. 

• Fault Tree Analysis. 

Failure mode and effect analysis (FMEA) is a systematic process whereby faults at the 
part/component level are identified and, by using failure rates for the appropriate stress 
levels, their effect at the system level is determined. Each part is considered, in turn, as 
having failed in each possible mode. The effect of each of these failures at various system 
levels is noted and a failure rate assigned from available data. Each system level failure 
mode will be seen to result from various possible component failures and these can be 
grouped together for the purpose of calculating the system failure rate. It is usual to 
consider only primary failure mode effects on system performance. In some cases, 
secondary effects and their consequences are also considered. 

Table 10 shows a typical failure mode analysis worksheet for carrying out FMEAs. 
The results of the failure mode analysis provide failure rate totals for each block in a 
system and these are then applied to reliability block diagrams. See Figure 14. The 
calculation of system MTBF or reliability is then a question of applying the appropriate 
prediction mathematics. 


TABLE 10 


PART: PCB 12 




Failure Mode 1 

Failure Mode 2 


Stress 

Failure 

Loss of output 

Spurious output 


Ratio 

Rate 





Component 


PMH 

Mode 

X 

Mode 

X 

R23 

0-5 

0-001 

o/c 

0-001 

s/c 

Negl. 

TR1 

- 

0-1 

o/c 

0-003 

N/A 

- 

Relay R6 







—coil 


0-05 

N/A 

- 

o/c 

0-05 

—contact 


0-15 

s/c 

0-015 

o/c 

0135 

IC8, Quad Op Amp 


0-1 

50% 

0-05 

25% 

0-025 

Totals 


5-75 

2-55 

1-45 


20 


Supplement to Br. Telecommun. Eng., Vol. 7, Oct. 1988 














Figure 14 

Reliability block diagram—spurious 
output 


PCB20 



This involves choosing system level faults and then constructing a logic diagram showing 
all the possible combinations of failure and conditions which lead to them. Failure mode 
probabilities are then computed from basic fault data. This method is considered most 
applicable during design and pre-production stages of a product. The result is a logic tree 
consisting of and and OR gates. In the ‘Terms and Jargon’ section, the concept is applied 
to the simple two-diode example. 

Figure 15 shows a more complex tree. The system failure being modelled is known as 
the top event and each individual path from top to bottom is known as a CUTSET . By 
using similar probability mathematics as is used in block diagram modelling, the system 
MTBF or reliability can be calculated if the bottom event failure rates and repair (down 
time) times are known. 


Figure 15 


0 


□ 



In most cases, the tree will be fairly complex and presents a problem in manual analysis. 
Fortunately, there are a large number of fault tree analysis computer programs available 
and most reliability consultants and several bureaux have access to them. Although the 
calculation is longwinded, the run time is comparatively modest and the cost of a fault 
tree run can be as little as a few tens of pounds. Most programs, as well as offering the 
system MTBF or reliability, will rank the CUTSETS in descending order of MTBF or 
unavailability, thus presenting the engineer with a valuable description of the description 
of the design in terms of the key contributors to system failure. 

Fault tree analysis is not suitable for sequential events where the top event is affected 
by the order. A technique known as cause consequence analysis is finding increasing use. 

Table 11 briefly compares the foregoing methods. 


Modelling 

The following probability rules are necessary for an understanding of reliability modelling. 
Multiplication Rule 

If two or more events can occur simultaneously, and their individual probabilities of 
occurring are known, then the probability of simultaneous events is the product of the 


Supplement to Br.Telecommun. Eng. y Vol. 7, Oct. 1988 


21 






























































































TABLE 11 



Advantages 

Disadvantages 

Parts Count 

Quick 

Does not address modes hence it only 
estimates overall maintenance rate 

FMEA and 
Reliability Block 
Diagram 

Addresses every component 
failure mode 

Difficult to apply to very complex sys¬ 
tems where simple RBDs are inadequate 

Fault Tree Analysis 

Flexible and addresses com¬ 
plex systems. Human error 
rates can be included. Many 
computer aids available 

Top down and thus open-ended; hence 
it is possible to miss some configurations 
which would cause system failure 

Simulation 

Caters for very complex sys¬ 
tems with any distribution of 
failure and repair rates 

Can be expensive, particularly with high 
reliability systems for which computer 
simulation is lengthy 


individual probabilities. The shaded area in Figure 16 represents the probability of events 
A and B occurring simultaneously. 

Hence the probability of A and B occurring is: 

Pab = P* X P b . 

Generally, 

P an = P,XP b .. ..XP n . 

Addition Rule Figure 16 

It is also required to calculate the probability of either event A OR event B or BOTH 
occurring. This is the area of the two circles in Figure 16. This probability is: 

P(aorb) = "F Pb PaPb> 

being the sum of P a and Pb less the area P a Pb which is included twice. This becomes: 

^(aorb) = 1 - (1 - Pa)(l " Pb). 

Hence, the probability of one or more of n events occurring is: 

= 1 -(1 -P a )(l -P b ).. ..(1 -P n ). 



Series Systems 

Consider the two diodes connected in series which were described earlier. One of the 
failure modes discussed was open circuit which occurs if either diode fails open. This 
situation, where any failure causes the system to fail, is known as series reliability, and a 
series block diagram is used. It so happens that, for this failure mode, the physical series 
and the reliability series diagrams coincide. When the short-circuit case is considered 
later, although the diodes are still in series, the reliability block diagram changes to 
parallel blocks. 

For open circuit then, the reliability of the system is the probability that diode A does 
not fail and diode B does not fail. 

From the multiplication rule, 

Pab = P a X R bi 

and, in general, 

R an = R,X R b .. .. XR n . 

In the constant failure rate case where: 

P a = 

then 

R n = exp{—(X a + X b .. ..X„)I}. 

From which it can be seen that the system is also a constant failure rate unit whose 
reliability is of the form t~ K \ where K is the sum of the individual failure rates. Provided 
that the two assumptions of constant failure rate and series modelling apply, then it is 
valid to talk of a system failure rate computed from the sum of the individual unit or 
component failure rates. 

The practice of adding up the failure rates in a component count type prediction is thus 
assuming that any single failure causes a system failure. It is therefore a worst-case 
prediction since, clearly, a failure mode analysis against a specific failure mode will involve 
only those components which contribute to that top event. 


22 


Supplement to Br. Telecommun. Eng., Vol. 7, Oct. 1988 








Returning to the example of the two diodes, assume that each has a failure rate of 7 X 
10“ 6 per hour for the fail open mode and consider the reliability for one year. One year 
has 8760 hours. 

From the above: 


^system = X a + Ab = 14 X 10“ 6 per hour. 

Xf = 8760 X 14 X lO" 6 = 0-1226. 
■^system = e Xf = 0 • 885. 


Redundancy 

Continuing with the example, consider the short-circuit failure mode. This is no longer a 
reliability series situation since both diodes need to fail open in order for the top event to 
occur. In this case, a parallel reliability block diagram is used. Since either, or both, diodes 
operating correctly is sufficient for system success, then: 

-^system = 1 (1 ^a)0 ^b)- 

In other words, 1 minus the product of their unreliabilities. Let us assume that for short 
circuit the failure rate is 3 X 10“ 6 per hour. 

/?a = Rb = e" x ', where \t = 3 X 10" 6 X 8760 = 0-026. 

e~ Xf = 0-974. 


System = 1 -(0-026)2 = 0-999. 

If there were N items in this redundant configuration such that all may fail except one, 
then the expression becomes 

System = 1 - (1 - J?.)(l - tfb). . . . (1 - R n ). 

There is a pitfall at this point which it is important to emphasise. The reliability of the 
system, after substitution of R = e“ x ', becomes: 

Rs = 2e’ x ' - e _2X '. 


It is very important to note that, unlike the series case, this combination of constant 
failure rate units exhibits a reliability characteristic which is NOT of the form e"*'. In 
other words, although constant failure rate units are involved, the failure rate of the 
system is variable. The MTBF can therefore be obtained only from the integral of 
reliability. 


Hence, 


MTBF 


MTBF 


R(t)dt. 


-r. 

= | (2e _A * — < 


= 2/2 — 1 / 22 , 


= 3/22, 

= 30/2, where 6 is the MBTF of a single unit. 


The danger now is to assume that the failure rate of the system is 2X/3. This is not 
true since the practice of inverting MTBF to obtain failure rate, and vice versa, is valid 
only for constant failure rate. Within the above working, 0 was substituted for 1 /X, but 
in that case a unit was being considered for which constant X applies. 

Other forms of redundancy involve partial, conditional and stand-by systems. Similar 
forms of equations apply and may be found in the literature. Figure 17 gives a general 
view. 


Figure 17 
Redundancy 



Supplement to Br.Telecommun. Eng. y Vol. 7, Oct. 1988 


23 






































Systems with Repair 

By repairing the failed units in a redundant system, a spectacular increase in reliability 
can be achieved, since the system is ‘vulnerable’ only during the down time of the failed 
redundant unit. There is therefore a relationship between reliability and down time, and 
hence with MTTR. In the following equations the repair rate (p) is the reciprocal of down 
time. 


(a) Active redundancy - two identical units. 

„ 3X + p p 

'' " >>i ' 

( b ) Full active redundancy —n identical units. 


0 = - 


nX" 


if f.i» X. 


(c) Stand-by redundancy—two identical units. 


0 = 


2k + n 
X 2 5 


if p» X. 


(d) Stand-by redundancy —n identical units. 
„(»- 1 ) 


0 = - 


X" 


if p » X. 


( e ) Active redundancy—two different units. 

q _ (^a + ^b)(^b + fta) + ^ a (2 a + b) + + /*a) 

X a X b (X Q + K + ^a + l l b) 

(/) Partial active redundancy 


TABLE 12 

System MTBF Table 


Total 1 

number 

of 

units 2 


11A 2 + IXn + 2/t 2 

5 X + p 

i 


6A 3 

6A 2 

3A 


25A 3 + 2M 2 n + 

13 A 2 + 5An + n 2 

7A + n 

i 

13A/I 2 + 3ju 3 

12 A 4 

12A 3 

12A 2 

4A 

1 

2 

3 

4 


Number of units required to operate 


2X 2 


1 

2X 


If p > X, then constant failure rate may be assumed and Table 12 can be expressed as 
shown in Table 13. If unavailability is required then, if the system MDT is the same as 
for individual units, the fact that A « XMDT yields Table 14. 


TABLE 13 

System Failure Rates 


Total 1 
Number 
of 2 
Units 


4 


Number of Units required to Operate 


A 




2A 2 MDT 

2X 



3A 3 MDT 2 

6A 2 MDT 

3 A 


4A 4 MDT 3 

12A 3 MDT 2 

12A 2 MDT 

4X 


1 2 3 4 


24 


Supplement to Br. Telecommun . Eng., Vol. 7, Oct. 1988 

































TABLE 14 

System Unavailability 


Total 1 
Number 
of 2 
Units 3 

4 


2 MDT 




22 2 MDT 2 

22 MDT 



32 3 MDT 3 

62 2 MDT 2 

32 MDT 


42 4 MDT 4 

122 3 MDT 3 

122 2 MDT 2 

42 MDT 

1 

2 

3 

4 


Number of Units required to Operate 



Common Cause Failure 

Consider a triplicated system as illustrated in Figure 18. The redundant configuration is 
such that 2 out of the 3 are required to operate. The failure rate of the configuration is 
6 X 2 MDT. Typical figures of 10 “ 4 per hour for failure rate and 10 hours mean down time 
for each unit are used, giving a system failure rate of 6 X 10“ 7 . Consider, however, that 
only one failure in 100 is of such a nature as to affect all three channels and to render 
the redundancy ineffective. It is necessary then to add on a series element whose failure 
rate is 

1% X 10 “ 4 = 10~ 6 . 

The effect is to swap the redundant part of the prediction with an element which is an 
order more unreliable. 

Typical causes of this type of failure arise from: 

• Design Software failures affecting all three channels, common power supplies and 
services, circuit elements common to the three functions but not foreseen during the 
failure mode analysis. 

• Manufacturing Batch-related component or process deficiencies. 

• Operating Procedure Maintenance and human-induced failures. 

• Environment Lightning, electrical interference etc. 

Typical defences against common-cause failures (also known as common-mode failures) 
are: 


• Operating Diversity Using two or more methods of achieving the function. 

• Equipment Diversity Using two or more designs in the redundancy. 

• Fail Safe Techniques Including voting. 

• Segregation Of Channels Both physical and electrical. 

• Proof Testing Periodic checks. 

The degree to which common-cause failures are likely depends on the type of design. 
Redundant channels having a high degree of segregation and diversity and with little 
interconnection will have a low susceptibility to the problem. On the other hand, designs 
involving closely interconnected units and having shared power supplies may well be 
dominated by common-cause failures. 

Clearly, any prediction of this phenomenon must be highly subjective since, if it were 
possible to identify the elements by means of a failure mode analysis, then they would be 
allowed for in the prediction as series elements in the ordinary way. In essence, a prediction 
of common-cause failures is an attempt to predict the unforeseeable. An estimate is 
formed of the value (beta) which corresponds to the 1 % used in the example. Typical 
figures used are in the range 5% to 25% for redundancy involving identical channels, and 
from 0 - 1 % to 10 % where diverse methods are employed. 


Prediction in Perspective 

It must be stressed that prediction is a design tool and not a precise measure of reliability. 
The main value of a prediction is in showing the relative reliabilities of modules so that 
allocations can be made. Whatever the accuracy of the exercise, if one module is shown 
to have double the MTBF of another then, when calculating values for modules in order 
to achieve the desired system MTBF, the values allocated to the modules should be in the 
same ratio. Prediction also permits a reliability comparison between different design 
solutions. Again, the comparison is likely to be accurate even if the absolute values are 
not. Concerning the accuracy of the actual predicted values this will depend on: 

(a) relevance of the failure rate data and the chosen environmental multiplication factors, 

( b) accuracy of the mathematical model, and 

(c) the absence of gross over-stressing in operation. 

The greater the number of different component types involved, the more likely that 
individual over- and under-estimates will cancel each other out. 


Supplement to Br.Telecommun. Eng ., Vol. 7, Oct. 1988 


25 































A reliability prediction is often necessary in response to a contract requirement or to 
an invitation to tender. It is advisable to have carried out predictions previously and a 
stock of subsystem failure rates based on appropriate failure mode analyses makes it 
possible to respond to such requirements with the minimum of effort. 

The chart in Figure 19 gives a more balanced view of reliability assessment by taking 
into account the various factors explained in this paper. Notice that it refers to mission, 
environmental and on/off profiles. In this section, it has been explained how to obtain the 
appropriate model for each of the configurations and repair conditions which are likely 
to occur. It is also necessary to apply these to each of the stress factors and obtain a 



Figure 19 

Predicting reliability 


prediction for each. The MTBF or reliability over a period must then be calculated by 
putting together the various proportions. If, for example, an equipment is used, in an 
environment which leads to a prediction of 100 X 10~ 6 failures per hour for 20% of the 
time, and is dormant, where 1 X10“ 6 per hour is assumed, for the rest of the time, then 
the overall prediction would be: 

20% X 100 X 10" 6 + 80% X 1 X 10- 6 = 20-8 X 10" 6 per hour. 


26 


Supplement to Br. Telecommun. Eng., Vol. 7, Oct. 1988 

















































































FAILURE RATE DATA SOURCES 

Failure rate data sources can be grouped in three categories: 

• published failure rate data, 

• data banks/bases which can be accessed, and 

• in-house data not available outside the particular organisation. 

Furthermore, there are three methods of presenting the data: 

• lists or tables of component failure rates, 

• regression models whereby failure rates can be obtained by specifying parameters (for 
example, temperature, load, packaging, a quality metric, some complexity metric etc.), 

• either of the foregoing, but including a breakdown of the failure rates into defined 
failure modes (for example, high leakage, open circuit, low output etc). 

The data sources which should be considered are described below. 


US Military Handbook 217E (1986) 

This is the most well known of the data sources. It is currently at issue E. 

It is produced by RADC (Rome Air Development Corporation) under contract to the 
US DOD and is primarily an electronic failure rate book. It covers: 

Micro-electronics 
Discrete semiconductors 
Tubes (thermionic) 

Lasers 

Resistors and capacitors 
Inductors 

Connections and connectors 

Meters 

Crystals 

Lamps, fuses and other miscellaneous items 

The micro-electronic sections present the information as a number of regression models. 
Passive components involve tables of failure rates and the use of multipliers to take 
account of quality and environment. 

The trend in successive issues of MIL 217 has been towards lower failure rates, 
particularly in the case of micro-electronics. This is also seen in other data banks and 
reflects the steady increase in manufacturing quality and screening techniques over the 
past 15 years. 


Nonelectronic Parts Reliability Data Book — PRD3 (1985) 

This document is also produced by RADC and is currently at issue 3. It was previously 
published as: 

NPRD 1 - 1981 

NPRD 2 - 1970s 

It comprises 367 pages of failure rates information for a wide range of electromechanical, 
mechanical, hydraulic and pneumatic parts. Failure rates are listed for a number of 
environmental applications. The main sections are: 

Nonelectronic Generic Failure Rates This provides the failure rate data against each 
component type. There are one or more entries per component type depending on the 
number of environmental applications for which a rate is available. 

Detailed Data Each piece of data is given with the number of failures and hours (or 
operations/cycles). Thus there are frequently multiple entries for a given component type. 

Failure Modes Details of the breakdown of failure modes are given. 

NPRD3 is also available on floppy disc from RADC. 


Handbook of Reliability Data for Electronic Components Used in Telecommuni¬ 
cations Systems — HRD4 (1986) 

This document is produced by British Telecom’s Materials and Components Centre. It 
offers both failure rate lists and a regression model. The models are much simpler than 
those offered in MIL 217; and, for example, bipolar SRAMs are described by: 

‘Bipolar SRAMS and fusible link PROMS (Category 1.3.9) 

X* = (94/ t)B 5 !\ 

where t = year of manufacturer taking the year of technology maturity as the base year, 
and 

B = number of bits (memories).’ 


Supplement to Br.Telecommun. Eng. y Vol. 7, Oct. 1988 


27 


The model involves the time for which the technology (not the device) has been in 
manufacture. The tables in HRD4 assume 1986 manufacture. 

The failure rates obtained from this document are generally optimistic compared with 
other sources, often by as much as an order of magnitude. This is due to an extensive 
‘screening’ of data whereby failures which can be attributed to a specific cause are 
eliminated from the data once remedial action has been introduced into the manufacturing 
process. Considerable effort is also directed towards eliminating maintenance induced 
failures from the data. 

HRD4 is also available, from British Telecom, on disc. 

Recueil de Donnes de Fiabilite du CNET (1983) 

This document is produced by the Centre National D’Etudes des Telecommunications 
(CNET). It was previously issued in 1981 and the latest edition is 1983. It has a similar 
structure to US MIL 217 in that it consists of regression models for the prediction of 
component failure rates as well as generic tables. 

The models involve a simple regression equation together with graphs and tables which 
enable each parameter to be specificed. This model is also stated as a parametric equation 
in terms of voltage, temperature etc. 

The French PTT uses the CNET data as its standard. 

Electronic Reliability Data—INSPEC/NCSR (1981) 

This book was published jointly by the Institution of Electrical Engineers and the National 
Centre of Systems Reliability (Warrington) in 1981. 

It is, now, somewhat dated and the failure rates pessimistic because of the growth in 
semiconductor reliability over the last decade. 

It consists of simple multiplicative models for semiconductor and passive electronic 
components with tables from which to establish the multipliers according to the environ¬ 
mental, temperature and other parameters. 

Reliability and Maintainability in Perspective, David J. Smith , Macmillan , 1988 

This is not orginal failure rate data and, in very few cases, offers a single rate for any one 
component. Its purpose is to show, for each component, the range of failure rate values 
which could be expected if all the data sources were consulted. Where a point, in the 
range, tends to predominate then this is indicated. 

A table is included for micro-electronic devices showing the comparatively wide ranges 
which apply. Since the variation between sources is far greater than the variation to be 
expected from variations of complexity, temperature and so on, the author recommends 
the use of the table in place of the more complicated and allegedly ‘precise’ models. 

Three pages of failure mode percentages are included and were compiled from various 
published documents over a number of years. 


MAINTAINABILITY 


Factors Influencing Down Time 

The two main factors governing down time are equipment design and the maintenance 
philosophy. In general, it is active repair elements which are determined by the design 
and the passive elements which are governed by the maintenance philosophy. The designer 
must be aware of the maintenance environment and of the possible equipment failure 
modes. The designer must understand that production difficulties all too often become 
field problems since, if assembly is difficult, maintenance will be nigh impossible. Achieving 
acceptable repair times involves facilitating diagnosis and repair. The main design 
parameters are: 

• Access 

• Adjustment 

• Built-in test equipment 

• Circuit layout and hardware partitioning 

• Connections and connectors 

• Displays and indicators 

• Handling and ergonomics 

• Identification 

• Interchangeability 

• Least replaceable assembly 

• Mounting 

• Component part types 

• Redundancy 


28 


Supplement to Br. Telecommun. Eng. y Vol. 7, Oct. 1988 





• Safety 

• Software 

• Standardisation 

• Test points 

Each of the factors has an affect on various of the items in Figure 6. The detailed 
consideration of each of the above features is covered in Reference 1. 

The maintenance philosophy aspects include: 

• Organisation of maintenance (first line, second line, overhaul) 

• Maintenance procedures 

• Tools and test equipment 

• Personnel (skills, training, motivation) 

• Manuals 

• Spares provisioning 

• Logistics 

Again, there are many considerations applying to the above list and these are dealt 
with in the reference. 

Predicting Repair Times 

There are methods for predicting repair times, the best known is detailed in US MIL 
Handbook 472 Procedure 3. It involves scoring the maintainability features of the 
equipment against a checklist similar to the two above lists; the scores are then converted 
into a mean active repair time. 

In practice, the scoring is repeated for a number of sample faults and a suitable average, 
weighted according to the frequency of the faults, is calculated. 

The main benefit of this procedure, besides providing an estimate of repair times, is 
that it forces the designer to evaluate the status of each of the maintainability features 
listed. 


SOFTWARE RELIABILITY AND QUALITY 


Terms 

The question arises as to how a software failure is defined. Unlike hardware failures, there 
is no physical change which causes a functioning unit to cease functioning. Software 
failures are in fact errors which, owing to the complexity of programs, do not always 
become evident immediately. Unlike the hardware ‘bathtub curve’, there is no wear-out 
feature since the population of bugs can only (save for modifications) decrease. Figure 20 
illustrates the concept of fault/error/failure. 


Random 

Over-stress 


Wear out Human 


Electrical 

Data Interference 



Figure 20 - 

Supplement to Br.Telecommun. Eng. y Vol. 7, Oct. 1988 


29 































Faults may occur in both hardware and software. Software faults—often known as 
bugs —will arise as the result of particular parts of the code being used for the first time 
or because of corruption due to some outside influence. 

The presence of a fault in a program does not necessarily result in either an error or 
failure. A long time may elapse before that portion of the code is used under the 
circumstances which lead to failure. 

A fault (bug) may lead to an error. This is the condition whereby the system is in an 
incorrect state. A data value or instruction is thus incorrect and only when that particular 
part of the code is executed is the error revealed. 

An error may propogate to become a failure if the system does not contain some error 
recovery logic capable of dealing with and minimising the effect of the error. 

A failure, be it hardware- or software-related, is the termination of the ability of an 
item to perform its specified function. 

Causes 

Faults are caused at all stages in the design process. There is evidence that the majority 
of errors (over 60%) are committed during the requirements and the design phases. The 
remaining 40% occur during coding. That is not to say that coding is not part of the 
design but it is only the final activity in a much larger process. The more complex the 
system, the more faults will be likely to stem from ambiguities and omissions in the 
specification stages. The major sources of fault are: 

(a) From the requirement specification 
Incorrect requirements due to: 

Model not a good fit to the physical situation. 

Incorrect document cross-references. 

Inconsistent or incompatible requirements: 

Two references give conflicting information. 

Conventions not consistent. 

Requirement unclear or illogical. 

Requirement omitted (for example, handling of invalid inputs). 

(b) From the design 

Unstructured approach to the design breakdown (that is, detail is considered first). 
Use of a non-standard language. 

Lack of change control. 

Specification was misunderstood. 

(c) From coding 

Semantic errors involving incorrect use of statements. 

Logical errors in translating the design into code. 

Detailed syntax errors which may have escaped detection by the compiler. 

Use of an incorrect condition at a branch. 

Poor data validation (for example, no default condition after a data input). 
Variables not initialised, or used incorrectly. 

Insufficient arithmetical accuracy. 

Insufficient range checks (for example, divide by zero). 

Type mismatch (for example, string used as a variable). 

Residual errors in compilers. 

The Problem 

The software quality problem is summarised in Figure 21, which highlights the three 
major areas. 

Requirements 

The user states what he/she perceives to be the requirements in free expression language. 
These may be ambiguous or may omit essential items. Nevertheless, it is against this 
requirements specification that the design proceeds and thus any errors in the requirements 
will be reflected in the design. Indeed, the user may not know precisely what is required 
but nevertheless will measure his/her use of the final system against his/her perceived 
requirement. 

The first area for development is the user requirements specification, which would 
benefit from some form of verification as do each of the succeeding levels of specification 
in the design. This verification is hampered by the free expression of the natural language 
used. 

The Design 

Once the design has proceeded to the stage where code is produced, then the question of 
validating that code arises. Currently, this is carried out by the design reviews which 


30 


Supplement to Br. Telecommun. Eng. y Vol. 7, Oct. 1988 






Figure 21 

The quality problem 


USER 



involve code inspections and walkthroughs. These are open-ended reviews which certainly 
identify a number of faults but they cannot address the correctness of the vast number 
of permutations of paths in software. 

Test 

Test methods range from tests of the basic coded modules, by using test harnesses, through 
the integrating of modules, to the ultimate functional tests carried out on the combined 
hardware and software. 

The Solution 

The likely solutions to these problems lie in the methods and techniques which are 
currently being developed and introduced into software design. 

Requirements 

Formal requirements languages are being developed which constrain the writer to a more 
formal high-level language often supplemented with graphical techniques. It thus becomes 
possible to verify the requirements by formal mathematical means. Such verification may 
well be automated in the future thus ensuring a better specification before actual design 
commences. 

Currently, such developments as VDM (Vienna Development Method), Z and OBJ 
address this area. 

The Design 

The validation of code, described above, can be carried out by suites of programs known 
as static analysers. They examine the code for a number of features involving syntax, 
semantics, data flow, algebraic relationships etc. Faults are detected and can be put right 
at this stage. Two such validators are currently available but more will follow. 

The two existing methods are MALPAS (available from Rex Thompson and Partners) 
and SPADE (available from Program Development Ltd.). 

Test 

The above two techniques, formal requirements language and static analyses, will eventu¬ 
ally enable tests to be specified automatically and it is likely that such test generating 
packages will become available as the above techniques are developed. 


Supplement to Br.Telecommun. Eng., Vol. 7, Oct. 1988 


31 




















MANAGEMENT OF RELIABILITY 


A number of standards have been written which define the reliability programme 
requirements. These may be called for in a contract. 

Defence Standard 00-5 Reliability and Maintainability of Land Service Material. 

This may be called for in UK MOD contracts. 

Defence Standard 00-40 Achievement of Reliability and Maintainability. This 

may be called for in UK MOD contracts. 

US MIL STD 785A Reliability Programs for Systems and Equipment Devel¬ 

opment and Production. This is called for in US Defence 
contracts. 


US MIL STD 470 


British Standard 5760 
AQAP 1 and AQAP 13 


Defence Standard 00-16 


Maintainability Program Requirements. Sometimes 
called for in US and UK MOD contracts. 

Reliability of Systems, equipments and Components. 
Called for in commercial contracts. 

NATO quality assurance systems of which AQAP 13 
widens the scope to software. 

Guide to the Achievement of Quality in Software. Quality 
assurance requirements for software development. 


BIBLIOGRAPHY 


General 

1 David J. Smith. Reliability and Maintainability in Perspective, 3rd Edition. Macmillan, 1988. 

2 Patrick D. T. O’Connor. Practical Reliabilty Engineering, 2nd Edition. Wiley, 1985. 

Mechanical 

3 Collins, J. A. Failure of Materials in Mechanical Design. Wiley, 1981. 

4 Hertzberg, R. W. Deformation and Fracture in Engineering Materials. Wiley, 1976. 

Electronics 

5 Reliability Design Handbook (Reliability Analysis Center, USA). 

6 The Reliability Handbook (National Semiconductor Corp.). 

Software 

7 David J. Smith, and Kenneth B. Wood, Engineering Quality Software. Elsevier Applied 
Science, 1987. 

8 Myers, G. J. Software Reliability, Principles and Practice. Wiley, 1976. 

9 Myers, G. J. The Art of Software Testing. Wiley, 1979. 

Maintainability 

10 Goldman, A. S., and Slattery, T. B. Maintainability: A Major Element of System Effective¬ 
ness. Wiley, 1964. 

11 Blanchard, B. S., and Lowery, E. E. Maintainability Principles and Practice. McGraw Hill, 
1969. 

Standards and Guidelines 

12 UK Defence Standard 00-40. Achievement of Reliability and Maintainability. 

13 UK Defence Standard 00-41. MOD Practices and Procedures in Reliability and Maintainability. 

14 UK British Standard 5760. Reliability in Systems, Equipments and Components. 

15 US MIL STD 785. Reliability Program Requirements. 

16 US MIL STD 781. Reliability Demonstration. 

17 US MIL HDBK 1629A. Failure Mode Effect and Criticality Analyses. 

18 US MIL HDBK 217D. Reliability Prediction for Electronic Systems. 

19 UK Defence Standard 00-16. Guide to The Achievement of Quality in Software. 

20 US MIL STD 52779A. Software Quality Assurance Program Requirements. 

21 UK Health and Safety Executive. Guidance on the use of Programmable Electronic Systems 
in Safety related Applications, 1987. 

22 US MIL STD 470. Maintainability Program Requirements. 

23 US MIL STD 471. Maintainability Demonstration. 

24 US MIL STD 472. Maintainability Prediction. 


32 


Supplement to Br. Telecommun. Eng., Vol. 7, Oct. 1988 





